meteharun: Implemented CUDA matrix multiplication#27

Open

meteharun wants to merge 1 commit into

parallelcomputingabo:mainfrom

meteharun:meteharun

meteharun commented May 31, 2025

CUDA Matrix Multiplication - Submission by meteharun

What I did

Implemented two CUDA kernels:
- naive_cuda_matmul: a basic version that multiplies matrices without any optimization.
- tiled_cuda_matmul: an improved version using shared memory and tiling to make it faster.

Optimizations

Used shared memory to store tiles of A and B, which reduces the number of slow global memory accesses.
Applied tiling (block-level matrix multiplication), so threads in the same block work together on small submatrices.
Added proper synchronization between threads to make sure shared memory is used correctly.

Challenges

CUDA compatibility issues in puhti
Avoiding reuse of variables like cudaEvent_t start in both kernels, which caused compilation errors.

Results

Added a table to README.md showing the timing results for both CUDA versions, and how much faster the tiled version is compared to the naive one and the CPU version.


          meteharun: Implemented CUDA matrix multiplication

6d4f94f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet